feat: Impl IncrementalAppendScan by WZhuo · Pull Request #590 · apache/iceberg-cpp

WZhuo · 2026-03-12T10:34:31Z

No description provided.

wgtmac

Note: This review was generated by Gemini.

wgtmac

Note: This review was generated by Gemini.

Summary & Recommendation

Request Changes: There are logic bugs in handling empty table snapshots and a potential hashing/equality discrepancy for manifest files that need to be addressed.

wgtmac · 2026-03-25T15:02:54Z

Mostly look good. I will merge this after two remaining issues have been addressed.

…nstantiate in header

Closes #2634. # Rationale for this change Adds `IncrementalAppendScan`, which reads the data appended between two snapshots — the building block for incremental ingestion. Largely a revival of the work in #2235; see #2634 and the previous PRs for motivation. Split out of #3364 at the reviewers' request, and builds on the now-merged `BaseScan` / `ManifestGroupPlanner` refactor (#3511), so this PR's diff is the append-scan feature alone. The surface mirrors Iceberg-Java's engine-facing API (snapshot IDs, inclusive/exclusive start, optional start) rather than the narrower Spark read options, since PyIceberg is increasingly used by engines (e.g. Polars). See the API discussion on this PR. References: https://github.com/apache/iceberg (Iceberg-Java and Spark) and apache/iceberg-cpp#590. Inline review-aid comments (prefixed `[AI reviewer aid]`) point at the relevant reference code. # API `Table.incremental_append_scan(...)` returns an `IncrementalAppendScan`; `StagedTable` overrides it to raise, mirroring `scan()`. The scan reads the rows added by **append** snapshots in `(from, to]`, projected onto the table's current schema; delete / overwrite / replace snapshots in the range (e.g. compaction) are ignored. The range is set via the factory's Spark-style kwargs or the builder methods, each of which returns a refined copy (like `select()` / `filter()`): ```python table.incremental_append_scan( from_snapshot_id_exclusive=None, # optional; defaults to the oldest ancestor of `to` to_snapshot_id_inclusive=None, # optional; defaults to the current snapshot row_filter=..., selected_fields=..., case_sensitive=..., options=..., limit=..., ) scan.from_snapshot_id_exclusive(id) # or .from_snapshot_id_inclusive(id) .to_snapshot_id_inclusive(id) ``` The range is held as public attributes — `from_snapshot_id` + `from_snapshot_inclusive` + `to_snapshot_id` — a single start slot plus an inclusive flag, mirroring Java's `TableScanContext` and consistent with the other scans. # Changes - Range resolution mirrors Java's `BaseIncrementalScan`: an unset start scans from the oldest ancestor of the end; an inclusive start resolves to its parent as the exclusive boundary; an exclusive start is validated with `is_parent_ancestor_of`, so an expired start cursor is accepted as long as the lineage still passes through it; the end defaults to the current snapshot; an empty table with no range set scans nothing. - Planning walks the append-only ancestors in the range, dedups the data manifests whose `added_snapshot_id` is in range (set semantics via `ManifestFile.__eq__` / `__hash__`), and filters manifest entries to `ADDED`-in-range via a new `manifest_entry_filter` on `ManifestGroupPlanner.plan_files`. Compacted (`rewrite_data_files`) output is therefore not picked up — no double counting. - Projects onto the table's **current** schema (matching Java/C++), so rows written under an older schema in the range get `NULL` for newer columns. - Adds snapshot helpers `ancestors_between_ids`, `is_ancestor_of`, and `is_parent_ancestor_of`. - Arrow materialization (`to_arrow` / `to_arrow_batch_reader`) is shared with `DataScan` via small module-level helpers that take the projected schema explicitly, so `BaseScan` stays projection-free (per the #3511 review). # Out of scope (tracked follow-ups) - Branch selection (`use_branch`) and per-endpoint ref/tag start & end (`from_ref_*` / `to_ref_*`) — the rest of the engine-facing surface Java exposes. - `count()`, REST server-side planning, and user-facing doc examples (`mkdocs`). - `dictionary_columns` on `IncrementalAppendScan.to_arrow` / `to_arrow_batch_reader` (added to `DataScan` in #3461; the shared helpers already thread it) — kept out to isolate this PR. # Are these changes tested? Yes — unit tests (range resolution including unset / inclusive / exclusive and expired start, current-schema projection, builder and `update()` copies, empty table, staged-table guard) and integration tests (append-only, non-append snapshots ignored, compaction not double-counted, schema evolution within range, partition- and metrics-evaluator pruning, disconnected snapshots), plus the `test_incremental_read` provision fixture. # Are there any user-facing changes? Yes — the new `Table.incremental_append_scan(...)` API and `IncrementalAppendScan` class. No changes to existing public surface. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

WZhuo force-pushed the incremental_append_scan branch 3 times, most recently from 9b822d0 to 4c4ffff Compare March 13, 2026 03:45

wgtmac reviewed Mar 20, 2026

View reviewed changes

Comment thread src/iceberg/table_scan.h

wgtmac reviewed Mar 20, 2026

View reviewed changes

Comment thread src/iceberg/table_scan.cc Outdated

wgtmac reviewed Mar 20, 2026

View reviewed changes

Comment thread src/iceberg/table_scan.h Outdated

Comment thread src/iceberg/table_scan.h Outdated

WZhuo force-pushed the incremental_append_scan branch 2 times, most recently from d1bacdf to 5c41a78 Compare March 23, 2026 03:30

WZhuo added 3 commits March 23, 2026 19:31

feat: Impl IncrementalAppendScan

179cea6

fix: deduplicate manifest files for multi snapshots

f1d259b

refactor: move planfile impl to source file

27d01c9

WZhuo force-pushed the incremental_append_scan branch 2 times, most recently from 3f97b02 to 8e2ad99 Compare March 24, 2026 06:10

wgtmac requested changes Mar 25, 2026

View reviewed changes

Comment thread src/iceberg/table_scan.cc

Comment thread src/iceberg/manifest/manifest_list.h

fix: define IncrementalScan's PlanFiles as pure virtual and no need i…

f3cd338

…nstantiate in header

WZhuo force-pushed the incremental_append_scan branch from 8e2ad99 to f3cd338 Compare March 26, 2026 10:03

wgtmac approved these changes Mar 26, 2026

View reviewed changes

wgtmac merged commit cdf05d6 into apache:main Mar 27, 2026
12 checks passed

smaheshwar-pltr mentioned this pull request Mar 27, 2026

Incremental Append Scan apache/iceberg-python#2634

Closed

smaheshwar-pltr mentioned this pull request May 18, 2026

Feature: Incremental Append Scan apache/iceberg-python#3364

Closed

smaheshwar-pltr mentioned this pull request Jun 16, 2026

Feature: Incremental Append Scan apache/iceberg-python#3512

Merged

WZhuo deleted the incremental_append_scan branch July 2, 2026 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Impl IncrementalAppendScan#590

feat: Impl IncrementalAppendScan#590
wgtmac merged 4 commits into
apache:mainfrom
WZhuo:incremental_append_scan

WZhuo commented Mar 12, 2026

Uh oh!

Uh oh!

wgtmac left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac left a comment

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

WZhuo commented Mar 12, 2026

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Summary & Recommendation

Uh oh!

Uh oh!

Uh oh!

wgtmac commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants